分散注意力的驾驶每年会导致数千人死亡,以及如何应用深度学习的方法来防止这些悲剧已成为一个关键问题。在第六AI城市挑战赛的Track3中,研究人员提供了一个具有密集动作注释的高质量视频数据集。由于数据量表和不清楚的动作边界,数据集提出了一个独特的挑战,可以精确地本地化所有不同的动作并对其类别进行分类。在本文中,我们充分利用了视频之间的多视图同步,并进行了强大的多视图实践(MVP)来驱动动作本地化。为了避免过度拟合,我们将Slowfast用动力学-700预训练作为特征提取器进行微调。然后,不同视图的功能将传递给ActionFormer,以生成候选行动建议。为了精确地本地化所有动作,我们设计了精心设计的后处理,包括模型投票,阈值过滤和删除重复。结果表明,我们的MVP对于驱动动作定位是可靠的,在Track3测试集中达到28.49%的F1分数。
translated by 谷歌翻译
在监控视频中的异常检测是挑战,对确保公共安全有挑战性。不同于基于像素的异常检测方法,基于姿势的方法利用高结构化的骨架数据,这降低了计算负担,并避免了背景噪声的负面影响。然而,与基于像素的方法不同,这可以直接利用显式运动特征,例如光学流,基于姿势的方法缺乏替代动态表示。在本文中,提出了一种新的运动嵌入器(ME)以提供从概率的角度来提供姿态运动表示。此外,为自我监控姿势序列重建部署了一种新型任务特定的空间 - 时间变压器(STT)。然后将这两个模块集成到统一规律学习的统一框架中,该框架被称为运动先前规律学习者(MOPLL)。 MOPRL在几个具有挑战性的数据集中实现了4.7%AUC的平均改善,实现了最先进的性能。广泛的实验验证每个提出的模块的多功能性。
translated by 谷歌翻译
高效的时空建模是视频动作识别的重要而挑战性问题。现有的最先进的方法利用相邻的特征差异,以获得短期时间建模的运动线索,简单的卷积。然而,只有一个本地卷积,由于接收领域有限而无法处理各种动作。此外,摄像机运动带来的动作耳鸣还将损害提取的运动功能的质量。在本文中,我们提出了一个时间显着积分(TSI)块,其主要包含突出运动激励(SME)模块和交叉感知时间集成(CTI)模块。具体地,中小企业旨在通过空间级局部 - 全局运动建模突出显示运动敏感区域,其中显着对准和金字塔型运动建模在相邻帧之间连续进行,以捕获由未对准背景引起的噪声较少的运动动态。 CTI旨在分别通过一组单独的1D卷积进行多感知时间建模。同时,不同看法的时间相互作用与注意机制相结合。通过这两个模块,通过引入有限的附加参数,可以有效地编码长短的短期时间关系。在几个流行的基准测试中进行了广泛的实验(即,某种东西 - 某种东西 - 东西 - 400,uCF-101和HMDB-51),这证明了我们所提出的方法的有效性。
translated by 谷歌翻译
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This problem requires methods not only generating proposals with precise temporal boundaries, but also retrieving proposals to cover truth action instances with high recall and high overlap using relatively fewer proposals. To address these difficulties, we introduce an effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion. Locally, BSN first locates temporal boundaries with high probabilities, then directly combines these boundaries as proposals. Globally, with Boundary-Sensitive Proposal feature, BSN retrieves proposals by evaluating the confidence of whether a proposal contains an action within its region. We conduct experiments on two challenging datasets: ActivityNet-1.3 and THUMOS14, where BSN outperforms other state-of-the-art temporal action proposal generation methods with high recall and high temporal precision. Finally, further experiments demonstrate that by combining existing action classifiers, our method significantly improves the state-of-the-art temporal action detection performance.
translated by 谷歌翻译
Participants in political discourse employ rhetorical strategies -- such as hedging, attributions, or denials -- to display varying degrees of belief commitments to claims proposed by themselves or others. Traditionally, political scientists have studied these epistemic phenomena through labor-intensive manual content analysis. We propose to help automate such work through epistemic stance prediction, drawn from research in computational semantics, to distinguish at the clausal level what is asserted, denied, or only ambivalently suggested by the author or other mentioned entities (belief holders). We first develop a simple RoBERTa-based model for multi-source stance predictions that outperforms more complex state-of-the-art modeling. Then we demonstrate its novel application to political science by conducting a large-scale analysis of the Mass Market Manifestos corpus of U.S. political opinion books, where we characterize trends in cited belief holders -- respected allies and opposed bogeymen -- across U.S. political ideologies.
translated by 谷歌翻译
While inferring common actor states (such as position or velocity) is an important and well-explored task of the perception system aboard a self-driving vehicle (SDV), it may not always provide sufficient information to the SDV. This is especially true in the case of active emergency vehicles (EVs), where light-based signals also need to be captured to provide a full context. We consider this problem and propose a sequential methodology for the detection of active EVs, using an off-the-shelf CNN model operating at a frame level and a downstream smoother that accounts for the temporal aspect of flashing EV lights. We also explore model improvements through data augmentation and training with additional hard samples.
translated by 谷歌翻译
A key feature of federated learning (FL) is to preserve the data privacy of end users. However, there still exist potential privacy leakage in exchanging gradients under FL. As a result, recent research often explores the differential privacy (DP) approaches to add noises to the computing results to address privacy concerns with low overheads, which however degrade the model performance. In this paper, we strike the balance of data privacy and efficiency by utilizing the pervasive social connections between users. Specifically, we propose SCFL, a novel Social-aware Clustered Federated Learning scheme, where mutually trusted individuals can freely form a social cluster and aggregate their raw model updates (e.g., gradients) inside each cluster before uploading to the cloud for global aggregation. By mixing model updates in a social group, adversaries can only eavesdrop the social-layer combined results, but not the privacy of individuals. We unfold the design of SCFL in three steps. \emph{i) Stable social cluster formation. Considering users' heterogeneous training samples and data distributions, we formulate the optimal social cluster formation problem as a federation game and devise a fair revenue allocation mechanism to resist free-riders. ii) Differentiated trust-privacy mapping}. For the clusters with low mutual trust, we design a customizable privacy preservation mechanism to adaptively sanitize participants' model updates depending on social trust degrees. iii) Distributed convergence}. A distributed two-sided matching algorithm is devised to attain an optimized disjoint partition with Nash-stable convergence. Experiments on Facebook network and MNIST/CIFAR-10 datasets validate that our SCFL can effectively enhance learning utility, improve user payoff, and enforce customizable privacy protection.
translated by 谷歌翻译
Transformer-based models have been widely demonstrated to be successful in computer vision tasks by modelling long-range dependencies and capturing global representations. However, they are often dominated by features of large patterns leading to the loss of local details (e.g., boundaries and small objects), which are critical in medical image segmentation. To alleviate this problem, we propose a Dual-Aggregation Transformer Network called DuAT, which is characterized by two innovative designs, namely, the Global-to-Local Spatial Aggregation (GLSA) and Selective Boundary Aggregation (SBA) modules. The GLSA has the ability to aggregate and represent both global and local spatial features, which are beneficial for locating large and small objects, respectively. The SBA module is used to aggregate the boundary characteristic from low-level features and semantic information from high-level features for better preserving boundary details and locating the re-calibration objects. Extensive experiments in six benchmark datasets demonstrate that our proposed model outperforms state-of-the-art methods in the segmentation of skin lesion images, and polyps in colonoscopy images. In addition, our approach is more robust than existing methods in various challenging situations such as small object segmentation and ambiguous object boundaries.
translated by 谷歌翻译
Users' involvement in creating and propagating news is a vital aspect of fake news detection in online social networks. Intuitively, credible users are more likely to share trustworthy news, while untrusted users have a higher probability of spreading untrustworthy news. In this paper, we construct a dual-layer graph (i.e., the news layer and the user layer) to extract multiple relations of news and users in social networks to derive rich information for detecting fake news. Based on the dual-layer graph, we propose a fake news detection model named Us-DeFake. It learns the propagation features of news in the news layer and the interaction features of users in the user layer. Through the inter-layer in the graph, Us-DeFake fuses the user signals that contain credibility information into the news features, to provide distinctive user-aware embeddings of news for fake news detection. The training process conducts on multiple dual-layer subgraphs obtained by a graph sampler to scale Us-DeFake in large scale social networks. Extensive experiments on real-world datasets illustrate the superiority of Us-DeFake which outperforms all baselines, and the users' credibility signals learned by interaction relation can notably improve the performance of our model.
translated by 谷歌翻译
Task-oriented dialogue systems often assist users with personal or confidential matters. For this reason, the developers of such a system are generally prohibited from observing actual usage. So how can they know where the system is failing and needs more training data or new functionality? In this work, we study ways in which realistic user utterances can be generated synthetically, to help increase the linguistic and functional coverage of the system, without compromising the privacy of actual users. To this end, we propose a two-stage Differentially Private (DP) generation method which first generates latent semantic parses, and then generates utterances based on the parses. Our proposed approach improves MAUVE by 3.8$\times$ and parse tree node-type overlap by 1.4$\times$ relative to current approaches for private synthetic data generation, improving both on fluency and semantic coverage. We further validate our approach on a realistic domain adaptation task of adding new functionality from private user data to a semantic parser, and show gains of 1.3$\times$ on its accuracy with the new feature.
translated by 谷歌翻译